Task A simple yet powerful marketing technique is an analysis utilizing recency (how recent was the customer's last purchase), frequency (how often did the customer make a purchase in a given period) and monetary (how much money did the customer spend in a given period) data to identify the best customers and perform targeted marketing campaigns.
As data scientist you are asked to segment the customers using transaction data and profile them based-on their characteristics (recency, frequency, monetary). After you find the segments, name them using understandable words so marketing team can easily create campaign strategies.
Data: ../data/transactions.csv
Hints: For each customer id, generate time difference between their last transaction and today. You should also calculate number of transaction and total amount of spending. You are allowed to use SQL.
Output: Push the executed notebook into your github repo and submit the URL to ketua kelas no later than August 21, 2021. Note that notebook must contain explanatory analysis and clustering as well as story about your findings.
Goodluck!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.graph_objs as go
import plotly.express as px
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, homogeneity_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from scipy.cluster.hierarchy import dendrogram, linkage
import warnings
warnings.filterwarnings("ignore")
import os
transaction_df = pd.read_csv(r'data/transactions.csv')
transaction_df.head()
| customer_id | trans_date | trans_amount | |
|---|---|---|---|
| 0 | CS5295 | 11-Feb-18 | 35 |
| 1 | CS4768 | 15-Mar-20 | 39 |
| 2 | CS2122 | 26-Feb-18 | 52 |
| 3 | CS1217 | 16-Nov-16 | 99 |
| 4 | CS1850 | 20-Nov-18 | 78 |
#Check column data types
transaction_df.dtypes
customer_id object trans_date object trans_amount int64 dtype: object
#We want to change trans_date type to datetime format
pd.to_datetime(transaction_df['trans_date'])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs) 654 try: --> 655 ret = self._build_naive(res, default) 656 except ValueError as e: c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\dateutil\parser\_parser.py in _build_naive(self, res, default) 1240 -> 1241 naive = default.replace(**repl) 1242 ValueError: day is out of range for month The above exception was the direct cause of the following exception: ParserError Traceback (most recent call last) pandas\_libs\tslib.pyx in pandas._libs.tslib.array_to_datetime() pandas\_libs\tslibs\parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string() c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\dateutil\parser\_parser.py in parse(timestr, parserinfo, **kwargs) 1373 else: -> 1374 return DEFAULTPARSER.parse(timestr, **kwargs) 1375 c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs) 656 except ValueError as e: --> 657 six.raise_from(ParserError(e.args[0] + ": %s", timestr), e) 658 c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\six.py in raise_from(value, from_value) ParserError: day is out of range for month: 29-Feb-17 During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) pandas\_libs\tslib.pyx in pandas._libs.tslib.array_to_datetime() TypeError: invalid string coercion to datetime During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs) 654 try: --> 655 ret = self._build_naive(res, default) 656 except ValueError as e: c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\dateutil\parser\_parser.py in _build_naive(self, res, default) 1240 -> 1241 naive = default.replace(**repl) 1242 ValueError: day is out of range for month The above exception was the direct cause of the following exception: ParserError Traceback (most recent call last) <ipython-input-4-765e841a055a> in <module> 1 #We want to change trans_date type to datetime format ----> 2 pd.to_datetime(transaction_df['trans_date']) c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache) 803 result = arg.map(cache_array) 804 else: --> 805 values = convert_listlike(arg._values, format) 806 result = arg._constructor(values, index=arg.index, name=arg.name) 807 elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)): c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\pandas\core\tools\datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact) 463 assert format is None or infer_datetime_format 464 utc = tz == "utc" --> 465 result, tz_parsed = objects_to_datetime64ns( 466 arg, 467 dayfirst=dayfirst, c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\pandas\core\arrays\datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object) 2088 return values.view("i8"), tz_parsed 2089 except (ValueError, TypeError): -> 2090 raise e 2091 2092 if tz_parsed is not None: c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\pandas\core\arrays\datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object) 2073 2074 try: -> 2075 result, tz_parsed = tslib.array_to_datetime( 2076 data, 2077 errors=errors, pandas\_libs\tslib.pyx in pandas._libs.tslib.array_to_datetime() pandas\_libs\tslib.pyx in pandas._libs.tslib.array_to_datetime() pandas\_libs\tslib.pyx in pandas._libs.tslib.array_to_datetime_object() pandas\_libs\tslib.pyx in pandas._libs.tslib.array_to_datetime_object() pandas\_libs\tslibs\parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string() c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\dateutil\parser\_parser.py in parse(timestr, parserinfo, **kwargs) 1372 return parser(parserinfo).parse(timestr, **kwargs) 1373 else: -> 1374 return DEFAULTPARSER.parse(timestr, **kwargs) 1375 1376 c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs) 655 ret = self._build_naive(res, default) 656 except ValueError as e: --> 657 six.raise_from(ParserError(e.args[0] + ": %s", timestr), e) 658 659 if not ignoretz: c:\users\940368\appdata\local\programs\python\python39\lib\site-packages\six.py in raise_from(value, from_value) ParserError: day is out of range for month: 29-Feb-17
transaction_df[transaction_df['trans_date']=='29-Feb-17']
| customer_id | trans_date | trans_amount | |
|---|---|---|---|
| 3460 | CS3985 | 29-Feb-17 | 40 |
| 6037 | CS5930 | 29-Feb-17 | 68 |
| 6732 | CS4670 | 29-Feb-17 | 100 |
| 8421 | CS2946 | 29-Feb-17 | 65 |
| 9102 | CS5458 | 29-Feb-17 | 73 |
| ... | ... | ... | ... |
| 116145 | CS8206 | 29-Feb-17 | 54 |
| 117817 | CS8352 | 29-Feb-17 | 79 |
| 117855 | CS7741 | 29-Feb-17 | 35 |
| 118973 | CS7897 | 29-Feb-17 | 65 |
| 119689 | CS7176 | 29-Feb-17 | 18 |
90 rows × 3 columns
From the output we can see we have invalid data in trans_date column, so we will drop the row that contain trans_date 29-Feb-17.
# Data Cleansing
transaction_df_2 = transaction_df.drop(transaction_df[transaction_df['trans_date']=='29-Feb-17'].index)
transaction_df_2
| customer_id | trans_date | trans_amount | |
|---|---|---|---|
| 0 | CS5295 | 11-Feb-18 | 35 |
| 1 | CS4768 | 15-Mar-20 | 39 |
| 2 | CS2122 | 26-Feb-18 | 52 |
| 3 | CS1217 | 16-Nov-16 | 99 |
| 4 | CS1850 | 20-Nov-18 | 78 |
| ... | ... | ... | ... |
| 124995 | CS8433 | 26-Jun-16 | 64 |
| 124996 | CS7232 | 19-Aug-19 | 38 |
| 124997 | CS8731 | 28-Nov-19 | 42 |
| 124998 | CS8133 | 14-Dec-18 | 13 |
| 124999 | CS7996 | 13-Dec-19 | 36 |
124910 rows × 3 columns
#We want to change trans_date type to datetime format
transaction_df_2['trans_date']=pd.to_datetime(transaction_df_2['trans_date'])
#Check column data types
transaction_df_2.dtypes
customer_id object trans_date datetime64[ns] trans_amount int64 dtype: object
transaction_df_2
| customer_id | trans_date | trans_amount | |
|---|---|---|---|
| 0 | CS5295 | 2018-02-11 | 35 |
| 1 | CS4768 | 2020-03-15 | 39 |
| 2 | CS2122 | 2018-02-26 | 52 |
| 3 | CS1217 | 2016-11-16 | 99 |
| 4 | CS1850 | 2018-11-20 | 78 |
| ... | ... | ... | ... |
| 124995 | CS8433 | 2016-06-26 | 64 |
| 124996 | CS7232 | 2019-08-19 | 38 |
| 124997 | CS8731 | 2019-11-28 | 42 |
| 124998 | CS8133 | 2018-12-14 | 13 |
| 124999 | CS7996 | 2019-12-13 | 36 |
124910 rows × 3 columns
#Check Duplicate data
transaction_df_2.duplicated().sum()
6
#Drop duplicate data
transaction_df_2=transaction_df_2.drop_duplicates()
#Check Duplicate data
transaction_df_2.duplicated().sum()
0
#Check Null Value
transaction_df_2.isnull().values.any()
False
today = pd.Timestamp.today()
Days_diff = (today - transaction_df_2['trans_date']).dt.days
transaction_df_2['Recency']=Days_diff
recency = transaction_df_2.groupby('customer_id').min('Recency')
recency.drop(columns='trans_amount',axis=1,inplace=True)
recency
| Recency | |
|---|---|
| customer_id | |
| CS1112 | 577 |
| CS1113 | 551 |
| CS1114 | 548 |
| CS1115 | 526 |
| CS1116 | 719 |
| ... | ... |
| CS8996 | 613 |
| CS8997 | 777 |
| CS8998 | 600 |
| CS8999 | 773 |
| CS9000 | 532 |
6889 rows × 1 columns
frequency=transaction_df_2.groupby('customer_id').count()
frequency['frequency']=transaction_df_2.groupby('customer_id').count()['trans_date']
frequency = frequency[['frequency']]
frequency
| frequency | |
|---|---|
| customer_id | |
| CS1112 | 15 |
| CS1113 | 20 |
| CS1114 | 19 |
| CS1115 | 22 |
| CS1116 | 13 |
| ... | ... |
| CS8996 | 13 |
| CS8997 | 14 |
| CS8998 | 13 |
| CS8999 | 12 |
| CS9000 | 13 |
6889 rows × 1 columns
monetary = transaction_df_2.groupby('customer_id').sum('trans_amount')
monetary.drop(columns='Recency',axis=1,inplace=True)
monetary
| trans_amount | |
|---|---|
| customer_id | |
| CS1112 | 1012 |
| CS1113 | 1490 |
| CS1114 | 1432 |
| CS1115 | 1659 |
| CS1116 | 857 |
| ... | ... |
| CS8996 | 582 |
| CS8997 | 543 |
| CS8998 | 624 |
| CS8999 | 383 |
| CS9000 | 533 |
6889 rows × 1 columns
#Join recency and frequency
data = recency.join(frequency)
#Join data and monetary
data = data.join(monetary)
data
| Recency | frequency | trans_amount | |
|---|---|---|---|
| customer_id | |||
| CS1112 | 577 | 15 | 1012 |
| CS1113 | 551 | 20 | 1490 |
| CS1114 | 548 | 19 | 1432 |
| CS1115 | 526 | 22 | 1659 |
| CS1116 | 719 | 13 | 857 |
| ... | ... | ... | ... |
| CS8996 | 613 | 13 | 582 |
| CS8997 | 777 | 14 | 543 |
| CS8998 | 600 | 13 | 624 |
| CS8999 | 773 | 12 | 383 |
| CS9000 | 532 | 13 | 533 |
6889 rows × 3 columns
data.shape
(6889, 3)
data.describe()
| Recency | frequency | trans_amount | |
|---|---|---|---|
| count | 6889.000000 | 6889.000000 | 6889.000000 |
| mean | 596.349252 | 18.130933 | 1178.359123 |
| std | 85.545443 | 5.192154 | 465.606743 |
| min | 515.000000 | 4.000000 | 149.000000 |
| 25% | 538.000000 | 14.000000 | 781.000000 |
| 50% | 569.000000 | 18.000000 | 1227.000000 |
| 75% | 627.000000 | 22.000000 | 1520.000000 |
| max | 1373.000000 | 39.000000 | 2933.000000 |
# Check inter varibale correlation
sns.pairplot(data, aspect=1.5)
plt.show()
From the plot we can see correlation of frequency and trans_amount variable
standardize = StandardScaler()
data_std = pd.DataFrame(standardize.fit_transform(data))
data_std.columns = data.columns
data_std
| Recency | frequency | trans_amount | |
|---|---|---|---|
| 0 | -0.226203 | -0.603056 | -0.357321 |
| 1 | -0.530157 | 0.360005 | 0.669371 |
| 2 | -0.565229 | 0.167393 | 0.544793 |
| 3 | -0.822421 | 0.745230 | 1.032364 |
| 4 | 1.433854 | -0.988281 | -0.690244 |
| ... | ... | ... | ... |
| 6884 | 0.194656 | -0.988281 | -1.280914 |
| 6885 | 2.111905 | -0.795668 | -1.364682 |
| 6886 | 0.042679 | -0.988281 | -1.190703 |
| 6887 | 2.065143 | -1.180893 | -1.708345 |
| 6888 | -0.752278 | -0.988281 | -1.386161 |
6889 rows × 3 columns
Lets segment the data based on 2 variables: Recency and frequency
data_is = data_std[['Recency' , 'frequency']].values
data_is
array([[-0.22620325, -0.60305619],
[-0.53015733, 0.36000517],
[-0.56522896, 0.1673929 ],
...,
[ 0.04267922, -0.98828073],
[ 2.06514296, -1.180893 ],
[-0.75227763, -0.98828073]])
k = 3
k_means = KMeans(init = "k-means++", n_clusters = k, random_state= 123)
k_means.fit(data_is)
centroids = k_means.cluster_centers_
centroids
array([[-0.28626929, -0.69791183],
[-0.3431659 , 0.88239681],
[ 1.9265558 , -0.73311912]])
Segment = k_means.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(data['Recency'], data['frequency'], hue=Segment,
palette=sns.color_palette('hls', k))
plt.title('KMeans with {} Clusters'.format(k))
plt.show()
k_means.cluster_centers_ outputs the centroid based on standardized-data. We can easily calculate the centroid using original data
centroids = data[['Recency','frequency']].groupby(Segment, as_index=False).mean()
centroids
| Recency | frequency | |
|---|---|---|
| 0 | 571.861996 | 14.507531 |
| 1 | 566.995104 | 22.712141 |
| 2 | 761.145361 | 14.324742 |
def k_means_inertia_silhouette(data, k_max = 9, init = 'k-means++'):
inertia = []
silhouette = []
for k in range(2, k_max):
km = KMeans(init = init, n_clusters = k, random_state= 123)
km.fit(data)
inertia.append(km.inertia_)
silhouette.append(silhouette_score(data, km.labels_))
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(2 , k_max) , inertia , 'o')
plt.plot(np.arange(2 , k_max) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.title('The Elbow method using Inertia for each number of cluster')
plt.show()
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(2 , k_max) , silhouette , 'o')
plt.plot(np.arange(2 , k_max) , silhouette , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Silhouette')
plt.title('Silhouette score for each number of cluster')
plt.show()
return inertia, silhouette
inertia, silhouette = k_means_inertia_silhouette(data = data_is)
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(k_means, k=(2,10))
visualizer.fit(data_std) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
k = 4
k_means = KMeans(init = "k-means++", n_clusters = k, random_state= 123)
k_means.fit(data_is)
centroids = k_means.cluster_centers_
centroids
array([[-0.41825981, 0.91510632],
[ 0.94502634, -0.37751625],
[ 3.13182193, -1.1681332 ],
[-0.43757019, -0.72787216]])
Segment = k_means.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(data['Recency'], data['frequency'], hue=Segment,
palette=sns.color_palette('hls', k))
plt.title('KMeans with {} Clusters'.format(k))
plt.show()
centroids = data[['Recency','frequency']].groupby(Segment, as_index=False).mean()
centroids
| Recency | frequency | |
|---|---|---|
| 0 | 560.571629 | 22.881961 |
| 1 | 677.238095 | 16.166289 |
| 2 | 864.617089 | 12.072785 |
| 3 | 558.919833 | 14.351983 |
inertia, silhouette = k_means_inertia_silhouette(data_std)
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,10))
visualizer.fit(data_std) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
k = 4
k_means = KMeans(init = "k-means++", n_clusters = k, random_state= 123)
k_means.fit(data_std)
KMeans(n_clusters=4, random_state=123)
Segment = k_means.labels_
data_std.groupby(Segment, as_index=False).mean()
| Recency | frequency | trans_amount | |
|---|---|---|---|
| 0 | -0.191317 | -1.038916 | -1.148154 |
| 1 | -0.351725 | 1.270868 | 1.222335 |
| 2 | 2.331673 | -0.951922 | -0.915636 |
| 3 | -0.198772 | 0.085647 | 0.178141 |
data['Segment'] = list(map(str, Segment))
fig = px.scatter_3d(data, x='Recency', y='frequency', z='trans_amount', color='Segment',
opacity = 0.8, height=800)
fig.show() # to show as static image: fig.show('svg')
data_std_2 = data_std.copy()
data_std_2['cluster']= k_means.labels_
melted_RFM=pd.melt(data_std_2.reset_index(),
id_vars=['cluster'],
value_vars=['Recency','frequency','trans_amount'],
var_name='Features',
value_name='Value'
)
sns.lineplot('Features','Value',hue='cluster',data=melted_RFM)
plt.legend()
<matplotlib.legend.Legend at 0x1d5673da250>
agglo_cluster = AgglomerativeClustering(n_clusters = 4, affinity = 'euclidean', linkage = 'complete')
Segment = agglo_cluster.fit_predict(data_std)
centroids = data.groupby(Segment, as_index=False).mean()
centroids
| Recency | frequency | trans_amount | |
|---|---|---|---|
| 0 | 581.030843 | 18.262321 | 1191.823688 |
| 1 | 809.322581 | 12.135945 | 653.391705 |
| 2 | 580.445161 | 30.251613 | 2152.870968 |
| 3 | 1235.100000 | 7.800000 | 387.800000 |
data_std_3 = data_std.copy()
data_std_3['cluster']=Segment
data['Segment'] = list(map(str, Segment))
fig = px.scatter_3d(data, x='Recency', y='frequency', z='trans_amount', color='Segment',
opacity = 0.8, height=800)
fig.show()
melted_RFM=pd.melt(data_std_3.reset_index(),
id_vars=['cluster'],
value_vars=['Recency','frequency','trans_amount'],
var_name='Features',
value_name='Value'
)
sns.lineplot('Features','Value',hue='cluster',data=melted_RFM)
plt.legend()
<matplotlib.legend.Legend at 0x1d568a988b0>
S=data_std.to_numpy()
S
array([[-0.22620325, -0.60305619, -0.35732128],
[-0.53015733, 0.36000517, 0.66937067],
[-0.56522896, 0.1673929 , 0.54479299],
...,
[ 0.04267922, -0.98828073, -1.19070304],
[ 2.06514296, -1.180893 , -1.7083448 ],
[-0.75227763, -0.98828073, -1.38616113]])
# Try to make estimated number of clusters = 4
db = DBSCAN(eps=0.2, min_samples=14).fit(S)
# Number of clusters in labels, ignoring noise if present.
# the label -1 is considered as noise by the DBSCAN algorithm
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
Estimated number of clusters: 4 Estimated number of noise points: 1484
Segmentdb = db.labels_
data['Segmentdb'] = list(map(str, Segmentdb))
fig = px.scatter_3d(data, x='Recency', y='frequency', z='trans_amount', color='Segmentdb',
opacity = 0.8, height=800)
fig.show() # to show as static image: fig.show('svg')
When we make 4 cluster with DBSCAN, we will get a lot of outlier
# K-means
unique, counts = np.unique(k_means.labels_, return_counts=True)
dict(zip(unique, counts))
{0: 1765, 1: 1738, 2: 641, 3: 2745}
# Hierarchical
unique, counts = np.unique(Segment, return_counts=True)
dict(zip(unique, counts))
{0: 6290, 1: 434, 2: 155, 3: 10}
# DbScan
unique, counts = np.unique(db.labels_, return_counts=True)
dict(zip(unique, counts))
{-1: 1484, 0: 5370, 1: 10, 2: 17, 3: 8}
K-Means seems better than Hierarchical and DBSCAN when using this dataset.
!jupyter nbconvert --to html clustering.ipynb